Goto

Collaborating Authors

 excel spreadsheet


Recycling the Web: A Method to Enhance Pre-training Data Quality and Quantity for Language Models

arXiv.org Artificial Intelligence

Scaling laws predict that the performance of large language models improves with increasing model size and data size. In practice, pre-training has been relying on massive web crawls, using almost all data sources publicly available on the internet so far. However, this pool of natural data does not grow at the same rate as the compute supply. Furthermore, the availability of high-quality texts is even more limited: data filtering pipelines often remove up to 99% of the initial web scrapes to achieve state-of-the-art. To address the "data wall" of pre-training scaling, our work explores ways to transform and recycle data discarded in existing filtering processes. We propose REWIRE, REcycling the Web with guIded REwrite, a method to enrich low-quality documents so that they could become useful for training. This in turn allows us to increase the representation of synthetic data in the final pre-training set. Experiments at 1B, 3B and 7B scales of the DCLM benchmark show that mixing high-quality raw texts and our rewritten texts lead to 1.0, 1.3 and 2.5 percentage points improvement respectively across 22 diverse tasks, compared to training on only filtered web data. Training on the raw-synthetic data mix is also more effective than having access to 2x web data. Through further analysis, we demonstrate that about 82% of the mixed in texts come from transforming lower-quality documents that would otherwise be discarded. REWIRE also outperforms related approaches of generating synthetic data, including Wikipedia-style paraphrasing, question-answer synthesizing and knowledge extraction. These results suggest that recycling web texts holds the potential for being a simple and effective approach for scaling pre-training data. We make our high-quality synthetic data publicly available at https://huggingface.co/datasets/facebook/recycling_the_web.


How AI is improving warehouse performance and easing supply chain disruptions

#artificialintelligence

Were you unable to attend Transform 2022? Check out all of the summit sessions in our on-demand library now! Unlocking greater performance gains in warehouses using artificial intelligence (AI) and machine learning (ML) helps make supply chains more resilient and capable of bouncing back faster from disruptions. Unfortunately, the severity and frequency of supply chain disruptions are increasing, with McKinsey finding that, on average, companies experience a disruption of one to two months in duration every 3.7 years. Over a decade, the financial fallout of supply chain disruptions in the consumer goods sector can equal 30% of a year's earnings before interest, taxes, depreciation and amortization (EBITDA).


Could No-Code Enable Everything Ops?

#artificialintelligence

It feels like DevOps principles are permeating every discipline, creating new buzzwords by the minute. This "JargonOps" is clearly encouraged by marketing campaigns (and bloggers, wink, wink). Yet, the phrases do depict a real trend: all industries are getting an efficiency overhaul in the wake of increased automation. As I've covered before, low-code and no-code tools lower the barrier to entry to application development, enabling field experts to construct workflows as they see fit. For tech-savvy non-engineers, this could be a huge boon to transform copy-and-paste stopgaps into efficient workflow automations.


Can an artificial intelligence bot make a 1% daily profit in cryptocurrency?

#artificialintelligence

I've been watching this trade pop up all over my newsfeed and I am so intrigued. I have a background in computer science and programming and I know enough to be dangerous, but this seems like it would take some serious skill. So what is the deal with automated trading bots? How can a bot make 1% daily profit? This post will answer all of your questions about automating cryptocurrency trading with artificial intelligence (AI) bots.


The Current State of Intelligent Automation

#artificialintelligence

Successful digital transformation requires many things. Among them, automation for speed and efficiency, but also better customer, partner, and employee experiences whether that's answering questions faster, automating document review or ensuring that customers' bank accounts are safe. Traditionally, automation was achieved through manual scripting, but more modernly there are visual low-code or no-code tools that help democratize the creation of an automated task. Intelligent automation uses AI to go beyond what's possible with deterministically programmed if-then-else systems. Bear in mind, the inclusion of AI is not necessary or even appropriate for all use cases.


How this Startup Uses AI to Automate Lease Accounting

#artificialintelligence

In theory, few industries are set up for the disruptive powers of AI as perfectly as accounting. The work of most modern accountants involves retrieving, presenting, and analyzing data from a range of transactions, which are essentially repeated over and over. All the while, they've got to ensure that they've compiled with the relevant accounting standards. Their work is, with respect to all of the fine accountants out there, the bread and butter of artificial intelligence. However, until recently, disruption in accounting was impeded by the depth and complexity of paperwork involved in some areas.


Software to accelerate R&D

#artificialintelligence

Many scientists and researchers still rely on Excel spreadsheets and lab notebooks to manage data from their experiments. That can work for single experiments, but companies tend to make decisions based on data from multiple experiments, some of which may take place at different labs, with slightly different parameters, and even in different countries. The situation often requires scientists to leave the lab bench to spend time gathering and merging data from various experiments. Teams of scientists may also struggle to know what the others have tried and which avenues of research still hold promise. Now the startup Uncountable has developed a digital workbook to help scientists get more from experimental data.


How to Get the Most out of Excel with Machine Learning

#artificialintelligence

Excel is perhaps the most well known data analysis tool out there. It's used to store and organize data such as sales numbers, profit rates, expenditures or revenues. Some businesses even use it to store text data. However, Excel is unable to organize text data without the help of machine learning. Machine learning algorithms can automatically analyze hundreds and thousands of rows of text data in a fast, consistent and scalable way. In other words, machine learning algorithms are able to quantify words and phrases in Excel, by assigning topics, keywords, entities, and even sentiment to each row of text.


What Is the Pandas in Machine learning?

#artificialintelligence

Machine learning is a complex discipline. The implementation of machine learning models is now far much easier than it used to be, this is as a result of Machine learning frameworks such as pandas. Wait!! isn't panda an animal? As I recall panda is an animal, this was my reaction in a Data science class by the end of the class I had completely grasped the concept of pandas. Pandas is an open-source library, free to use (under theBSD license) and it was originally written by Wes McKinney back in 2009.


How to use game-changing AI to boost decision quality

#artificialintelligence

And show why Game AI for business decisions makes so much sense. In the debate about AI for business, there is a lot of focus on the data, but data by itself doesn't generate value. Businesses generate value by turning data into decisions.